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Towards a Real-time Transient Classification Engine 
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Temporal sampling does more than add another axis to the vector of observables. Instead, under the recognition that how 
objects change (and move) in time speaks directly to the physics underlying astronomical phenomena, next-generation 
wide-field synoptic surveys are poised to revolutionize our understanding of just about anything that goes bump in the night 
(which is just about everything at some level). Still, even the most ambitious surveys will require targeted spectroscopic 
follow-up to fill in the physical details of newly discovered transients. We are now building a new system intended to 
ingest and classify transient phenomena in near real-time from high-throughput imaging data streams. Described herein, 
the Transient Classification Project at Berkeley will be making use of classification techniques operating on "features" 
extracted from time series and contextual (static) information. We also highlight the need for a community adoption of a 
standard representation of astronomical time series data (ie. "VOTimeseries"). 
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1 Introduction 

Classification and knowledge extraction from large imaging 
surveys of the static sky is a maturing endeavor. Star-galaxy 
classification and photometric redshift estimates are well- 
posed problems where knowledge of the underlying physics 
allows for robust estimates of uncertainty. Automated clas- 
sification is critical for source demographics, especially in- 
forming large statistical questions of the data. More recently, 
the autonomous classification with time domain surveys have 
had some notable successes, particularly with microlensing 
surveys 12 and supernova searches 4,5,9. Indeed, optimiz- 
ing reductions, survey cadence strategies, and software al- 
gorithms in the search of a specific class of phenomena has 
been the most straightforward use of synoptic imaging (see 
Bailey, this workshop). While interested primarily in super- 
novae, the deep lens survey (2; see Becker, this workshop) is 
one of the few large-field synoptic surveys that attempted to 
provide multi-class inferences in real-time. Still, those clas- 
sifications were rather broadbrushed in physical scope ("su- 
pernova", "fast moving", "slow moving", "variable star", 
"unknown (stationary)"). With the advent of large-scale mul- 
ticolor imaging surveys to appreciable depths (e.g., DES, 
Pan-STARRSl, Skymapper, & LSST), the need for a real- 
time and general classification scheme of astronomical tran- 
sient is particularly pressing. 

The challenges in time-domain classification may be sub- 
divided into "discovery" and "inference." Discovery of a 



variable or a moving source requires at least two images 
with the same filter system. The characteristic separation in 
time will dictate the types of sources that are seen to vary: 
taken too close in time, moving solar system sources do not 
change their apparent angular position enough to recognize 
change; taken too distant in time, slowly moving solar sys- 
tem sources would be single apparition point sources, con- 
fused for extra-solar events. Without the benefit of filtering 
techniques with several images of a static sky, transient dis- 
covery on a few images is particular prone to cosmetic de- 
fects in the imaging arrays, cosmic rays, and near-field (non- 
astrophysical) interlopers 1 . Transient discovery can be per- 
formed either in "catalog space" (noting significant changes 
of source brightness in two epochs) or with source extrac- 
tion in image differences 1. The former is generally less 
computationally intensive but more susceptible to error due 
to variable seeing and imaging array defects. Image differ- 
encing is generally more robust in crowded fields and where 
transients are embedded in galaxy light. 

Once a transient source is discovered, we wish to sur- 
mise the nature of the source. Major features of a general 
classification scheme are identified: 

- The inferences about the physical nature of the source 
(and the source variability) should make full use of prior 
knowledge about transients without coercing every new 
transient into a predefined set of classes. Different users 
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1 Having more than one concurrent image from at least two sites greatly 
reduces false positives (as employed with RAPTOR and TAOS experi- 
ments). 
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should be allowed to tune their priors as the science re- 
quires. 

- The inferences will necessarily be probabilistic in na- 
ture and should evolve in time as more observations are 
obtained. 

- The classifications should be as near real-time as possi- 
ble to allow appropriate follow-up. 

- The classifications should allow feedback from end users 
and adapt the classification algorithms accordingly. 

We are now building a framework that will be capable of 
classifying transient sources from time-domain surveys with 
these features. Similar work in a machine-learning context 
has been reported elsewhere (e.g., 13, 14 and Mahabal, this 
workshop; Bailey, this workshop). There are several com- 
ponents to this "Transient Classification Project" (TCP), de- 
scribed herein. We aim to have a working system in place 
by the time that the Palomar Transients Factory (PTF; 8) 
comes on-line in Fall 2008. 
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with S a ,s equal to the weighted mean position of O and Si. 
Since we expect — 2 In Oj to be distributed as \ 2 with 2 de- 
grees of freedom we associate the object with the Si meet- 
ing some predetermined probability threshold. As the num- 
ber of objects associated with a source grows, the number 
of lower probability associations should also grow; through 
a simulation, we have found the correct number of sources 
are created if we change the probability threshold in accor- 
dance with the number of sources already associated with 
that object. 

3 Source Classification 



2 Data Ingest 

The starting point of transient classification, from the per- 
spective of the TCP, is a stream of metadata describing the 
individual detections (either from image differences or cat- 
alog detections). We coerce this metadata stream to an in- 
ternal data model using a custom translation client written 
for each survey. Since we have been developing the TCP 
using, primarily, the public data from the SDSS-1I stripe 82 
survey 7, we found it necessary to recalibrate the photom- 
etry and astro me try from the raw detection files. Objects 
from all surveys are ingested into a relational database with 
the object positions indexed using the hierarchical triangle 
mesh (HTM; 10) at depth of 14 and 25 to allow for fast and 
accurate searching. 

Object detections need to be associated with astrophysi- 
cal sources. Unlike with static or after-the-fact time-domain 
surveys where a filtered deep sky image may be used as 
the fiducial "true" representation of source brightnesses and 
source positions, a real-time time-domain survey necessar- 
ily must associate each object detection with an astrophysi- 
cal source. We create sources on-the-fiy using a probabilis- 
tic framework that asks the question whether a new object 
belongs to an existing source or demands the creation of a 
new source. For a new object with detected position (O a ± 
Os ± <Jo,s) we can find a set of possible associated 
sources {S} by searching in the source catalog for sources 
with positions with angular distance 2 d < do^jcro,a^o,s of 
O; typically we use do ~ 10. For each source Si in {S} 
we compute the logarithm of the odds ratio log compar- 
ing the hypotheses that the object belongs to that source or 
should be a separate source. Under the assumption of Gaus- 



2 This is quite robust if the uncertainties of the source positions in the 
current source catalog are similar (typically <1") and comparable or less 
than the object positional uncertainties 



3.1 Representation 

The creation or modification of a source triggers a series of 
steps that will lead to an updated statement about the nature 
of that source. In an effort to modularize the software tasks, 
and prepare for the possibility of distributing the computa- 
tional tasks (see 4), we have build a portable (XML-based) 
source container (which we called "VOSource"), consisting 
of rudimentary source position, results from survey queries 
(such as NED), and the time series photometry associated 
with the source. The time series (which we call "VOTime- 
series" is marked up as a VOTable 3 , similar to the way in 
which time series data are represented in the VizieR 4 cata- 
logs. An example of a "VOSource" container is: 

<VOSOURCE version="0 . 01"> 

<COOSYS equinox=" J2000" epoch=" J2000" 

system= "eq_FK5" /> 
<dbID>62</dbID> 
<WhereWhen> 

<Description> cur rent posit ion</Description> 
<Position2D unit='deg'> 

<!-- same as VOEvent positions — > 
< /Posit ion2D></WhereWhen> 
<VOTIMESERIES ver s ion=" . 1 " > 
<TIMES YSxTimeType 

ucd=" frame . time . system? " >MJD</TimeType> 
<Time System ucd=" frame .time . scale "> 

UTC</TimeSystem> 
<TimeRef Pos ucd="pos ; frame . time" > 
TOPOCENTER</TimeRef Pos> </TIMESYS> 
<RESOURCE name='db photometry' > 
< TABLE name=' sdss-i' > 
<FIELD name='t' ID = 'coll' system^' TIMESYS' 

datatype^' float' unit='day'/> 
<FIELD name='m' ID='col2' 
ucd=' phot . mag; em. opt . i' 
datatype^' float ' unit='mag'/> 
<FIELD name='m_err' ID='col3' 

ucd=' stat . error ; phot .mag; em. opt . i' 

3 http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOTable 

4 http://vizier.u-strasbg.fr/viz-bin/VizieR 
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datatype=' float' unit=' mag' /> 
< D AT Ax TABLED AT A> 

<TR row=' 1' ><TD>53352 .1454934000</TD> 
<TD>2 . 0757000000</TD> 
<TD>0 . 074 9128000</TD></TR> 



3.2 Feature Extraction: Mixing the Time-domain with 
Context 

The time series of a source is anything but standard — data 
are irregularly sampled, noisy, sometimes spurious, and may 
include detections as well as non-detections. In light of this, 
we seek to homogenize the time series data by extracting 
"features." A feature is a real-number line mapping, often 
involving basic statistical metrics on the time series (such 
as x 2 P er degree of freedom or skewness). We are develop- 
ing a custom Python codebase which is capable of ingesting 
a VOTimeseries and returning the features of that time se- 
ries. For sources with too few observations appropriate for 
a given feature (such as "largest significant peak in a Lomb- 
Scarle periodogram"), the results from that specific feature 
extraction is reported as undefined. 

One of the benefits of mapping heterogeneous informa- 
tion to a series of real-number lines is that the nature of 
that information is abstracted from operations performed 
on those feature vectors farther downstream. To this end, 
(static) context related to a transient source (e.g., the loca- 
tion with respect to the supergalactic plane, distance to near- 
est cataloged galaxy, redshift of that galaxy, etc.) can play 
a powerful discriminating role. For example, a new point 
source which is discovered close to the ecliptic plane and a 
region of significant Galactic extinction is much more likely 
to be a slow moving solar system object than a distant su- 
pernova. 

3.3 Rapid Identification and Adaptation Using 
Machine Learning 

The TCP will make use of prior knowledge of time series 
and contextual information for each known class of tran- 
sient. To do so, we are assembling a large labeled train- 
ing set of real-world examples of known classes. To create 
set of prior feature distributions, these sources will be de- 
graded and sampled with cadences and sensitivities typical 
of the survey. A new source and its subsequently derived 
feature vector can be compared directly to the priors. We 
do not yet know what family of classifiers will prove the 
most robust (see Mahabal, this workshop, for an extensive 
discussion of current techniques). However, we are explor- 
ing the use of pairwise Naive Bayesian voting techniques 
as a fast approach capable of yielding probabilistic state- 
ments about the nature of new transients. Online learning 
algorithms 3 (such as "shifting experts" 6) and ensemble 
algorithms (such as "boosting" 11) should be particularly 
applicable, since these classes of algorithms allow quick 
updates of the classification inferences without needing to 
re-analyze all the available data. When a new event arrives 



that cannot be characterized by existing classes, methods 
for identifying the anomaly and incorporating it into a new 
class are being considered. 

4 Future Steps 

The TCP is a work in progress but the basic architectural 
decisions have now been put in place. Aside from the de- 
velopment and testing of the machine learning techniques, 
there are several other elements we hope the implement in 
the upcoming year: 

- Multi-survey Footprint Server. Non-detections of a 
transient source provide valuable constraints for classi- 
fication. A footprint server, providing upper-limits for a 
given position, time and filter, will be therefore crucial 
to the classification algorithms 

- Distribution. We plan to make full use of the VOEvent- 
Net architecture to distribute newly-classified transients 
to TCP clients (using a variety of push/pull mechanisms) 

- Feedback Mechanisms We will require a formalized 
conduit for end users (on the receiving end of proba- 
bilistic classifications) to feed back in to the system the 
outcome of transient followup. 

- Massively Distributed Computing We are scoping the 
use of the BOINC architecture 5 to create a "TCP@Home" 
environment, where individual users will provide spare 
CPU cycles to crunch feature extraction methods and 
run classification algorithms. 

While especially suited for the PTF, this classification 
engine is being built to not only allow several surveys streams 
to flow through the system but allow the information ex- 
tracted in each stream to inform the classifications derived 
from other surveys. Since the implementation of the feature 
extraction and classification algorithms is atomized, we ex- 
pect TCP to scale well to the data rates advertised for LSST. 
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